# Loading packages
library(pryr) # Memory usage information
library(lubridate) # date manipulation
Attaching package: ‘lubridate’
The following object is masked from ‘package:base’:
date
library(dplyr) # data manipulation
Attaching package: ‘dplyr’
The following objects are masked from ‘package:lubridate’:
intersect, setdiff, union
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(tidyr) # data manipulation
library(ggplot2) # graphics
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
library(stringr)
library(stringi) # remove diacritics
library(ggfortify)
library(rjson)
library(nasapower)
There are 4 csv files containing data about the fires of all ocurrences around all brazilian territory. Each file has data between 1st Sept to 30th Sept of each year from 2015 to 2019.
temp <- list.files(path = "./database" ,pattern = "*.csv")
myfiles = lapply(paste0("./database/",temp), read.csv)
nrecords <- lapply(myfiles, nrow)
data <- bind_rows(myfiles)
Unequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vector
names(data)
[1] "datahora" "satelite" "pais" "estado" "municipio" "bioma"
[7] "diasemchuva" "precipitacao" "riscofogo" "latitude" "longitude" "frp"
# As we have many data records, we'll be working with a small subset of it
df <- data[sample(nrow(data), 0.1*nrow(data)), ]
totalNo <- nrow(df)
object_size(df)
8.32 MB
there are 820929 observations in our dataset which sums up to 66.8 MB of storaged data.
head(df)
str(df)
'data.frame': 88511 obs. of 12 variables:
$ datahora : chr "2018/09/07 17:45:00" "2017/08/31 16:34:00" "2017/09/06 17:35:00" "2017/09/24 17:24:00" ...
$ satelite : Factor w/ 1 level "AQUA_M-T": 1 1 1 1 1 1 1 1 1 1 ...
$ pais : Factor w/ 1 level "Brasil": 1 1 1 1 1 1 1 1 1 1 ...
$ estado : chr "RONDONIA" "MARANHAO" "MATO GROSSO" "PARA" ...
$ municipio : chr "MONTE NEGRO" "CAROLINA" "NOVA BANDEIRANTES" "PLACAS" ...
$ bioma : Factor w/ 6 levels "Amazonia","Caatinga",..: 1 3 1 1 3 1 1 3 3 1 ...
$ diasemchuva : int 6 0 0 0 NA 2 7 NA 21 NA ...
$ precipitacao: num 0 0 0 0.9 NA 1.37 0 0 0.66 0 ...
$ riscofogo : num 1 1 0.75 0.68 NA 0.64 1 1 0.44 1 ...
$ latitude : num -10.01 -7.75 -9.53 -3.62 -13.03 ...
$ longitude : num -63.5 -46.8 -57.9 -54.3 -47.5 ...
$ frp : num 9 NA NA NA NA 17.8 15.5 NA 46 NA ...
Aparentemente, A coluna de área industrial e FRP tem poucas entradas não nulas
paste("Percentage of NAs entries on `FRP` =",
100*sum(is.na(df$frp))/totalNo)
[1] "Percentage of NAs entries on `FRP` = 68.8727954717493"
wetDays <- df$diasemchuva <= 10
dryDays <- df$diasemchuva > 10
noWeatherInfo <- is.na(df$diasemchuva)
paste("Percentage of NAs entries on `diasemchuva` =",
round(100*(sum(noWeatherInfo))/totalNo,2))
[1] "Percentage of NAs entries on `diasemchuva` = 45.43"
paste("Percentage of records with less than (or equal to) 10 days of no rain =",
round(100*(sum(wetDays, na.rm = T))/totalNo,2))
[1] "Percentage of records with less than (or equal to) 10 days of no rain = 42.72"
paste("Percentage of records with more than 10 days of no rain =",
round(100*(sum(dryDays, na.rm = T))/totalNo,2))
[1] "Percentage of records with more than 10 days of no rain = 11.85"
By the last results, it is possible that the number of days with no rain have a huge impact over the fires triggers.
we can also investigate another columns that have either redundant data or not useful data.
levels(df$pais)
[1] "Brasil"
levels(df$satelite)
[1] "AQUA_M-T"
df$pais <- NULL
df$satelite <- NULL
object_size(df)
7.61 MB
By looking at the structure of our dataframe, it becomes clear that the Datahora is a datetime object not a Factor.
df$datahora <- ymd_hms(str_replace_all(df$datahora,"/","-"))
levels(df$bioma)
[1] "Amazonia" "Caatinga" "Cerrado" "Mata Atlantica" "Pampa" "Pantanal"
We have 6 distincts biomas on our dataset. Perhaps, by knowing how is the distribution of fires among different biomas, we can refine our research.
counts <- table(df$bioma)
counts
Amazonia Caatinga Cerrado Mata Atlantica Pampa Pantanal
43954 5559 28766 7436 457 2339
proportions <- 100*counts / sum(counts)
proportions
Amazonia Caatinga Cerrado Mata Atlantica Pampa Pantanal
49.659364 6.280575 32.499915 8.401216 0.516320 2.642609
As we can see, Amazonia represents around half of our dataset. In addition, the number of fire occurrences is very much like the proportion of land that each bioma has in brazilian territory.
————————– Visualisations —————————–
ggplot(df) +
geom_bar(aes(x = bioma, y = 100*..prop.., group = 1)) +
coord_flip()

Now, it is also important to get the underlying trend of the number of fire ocurrences through the months of the year. Let’s start exploring this idea:
df %>%
mutate(month = floor_date(date(datahora), "month")) %>%
group_by(month, bioma) %>%
summarise(n_fires = n()) %>%
ggplot(aes(x = month, y = n_fires/1000)) +
geom_bar(aes(fill = bioma), stat = 'identity') +
scale_x_date(date_labels = "%b/%y", date_breaks = "6 months") +
labs(y = "Number of fires in thousands", title = 'Brazilian Wildfires by Year')

In order to plot the number of fires by state, we must do some data wrangling to put the data in the desired format.
is.odd <- function(x) x %% 2 != 0
is.even <- function(x) x %% 2 == 0
geo_data <- fromJSON(file = "./geo_data/geo_data.json")
maps_df <- list()
for (i in 1:length(geo_data$features)) {
coordinates <- unlist(geo_data$features[[i]]$geometry$coordinates[[1]])
long <- coordinates[is.odd(seq_along(coordinates))]
lat <- coordinates[is.even(seq_along(coordinates))]
if(length(long) != length(lat)){
stop("Wrong subset of coordinate vector", call. = FALSE)
}
geo_length <- length(long)
estado <- toupper(geo_data$features[[i]]$properties$name)
codigo <- as.integer(geo_data$features[[i]]$properties$codigo_ibg)
regiao <- as.integer(geo_data$features[[i]]$properties$regiao_id)
maps_df[[i]] <- as_tibble(
list(
estado = rep(stri_trans_general(estado,"Latin-ASCII"), geo_length),
codigo = rep(codigo, geo_length),
regiao = rep(regiao, geo_length),
longitude = long,
latitude = lat
)
)
}
map_df <- bind_rows(maps_df)
# the `estado` column of the map dataframe is a character, so in order
# to avoid coertion from factor to character, let's convert it explicitly
df$estado <- as.character(df$estado)
df %>%
select(estado) %>%
group_by(estado) %>%
summarise(n = n()) %>%
right_join(map_df, by = 'estado') %>%
ggplot(aes(x = longitude, y = latitude, group = codigo, fill = n/1000)) +
geom_polygon() +
geom_path(color = 'white', size = 0.1) +
scale_fill_continuous(low = "orange",
high = "darkred",
name = 'Number of fires\n(thousands)'
)

As we can see, the state with the most fire ocurrences is Pará in which most of its lands belongs to the amazon ecossystem, this may explain why half the dataset ocurrences comes from Amazon fires.
As we can see below, this trend happens all years.
df %>%
select(estado, datahora) %>%
mutate(ano = year(datahora)) %>%
group_by(estado, ano) %>%
summarise(n = n()) %>%
right_join(map_df, by = 'estado') %>%
ggplot(aes(x = longitude, y = latitude, group = codigo, fill = n/1000)) +
geom_polygon() +
facet_wrap(~ano) +
geom_path(color = 'white', size = 0.1) +
scale_fill_continuous(low = "orange",
high = "darkred",
name = 'Number of fires\n(thousands)'
)

In terms of raw numbers, we can visualize this difference below.
df %>%
group_by(estado) %>%
ggplot(aes(x = factor(estado))) +
geom_bar() +
coord_flip()

Going down one level, let’s try an analysis by county
geo_data_counties <- fromJSON(file = "./geo_data/geo_data_counties.json")
maps_df_counties <- list()
for (i in 1:length(geo_data_counties$features)) {
coordinates <- unlist(geo_data_counties$features[[i]]$geometry$coordinates[[1]])
long <- coordinates[is.odd(seq_along(coordinates))]
lat <- coordinates[is.even(seq_along(coordinates))]
if(length(long) != length(lat)){
stop("Wrong subset of coordinate vector", call. = FALSE)
}
geo_length <- length(long)
municipio <- geo_data_counties$features[[i]]$properties$name
municipio <- toupper(stri_trans_general(municipio, "Latin-ASCII"))
codigo <- as.integer(geo_data_counties$features[[i]]$properties$id)
maps_df_counties[[i]] <- as_tibble(
list(
municipio = rep(stri_trans_general(municipio, "Latin-ASCII"), geo_length),
codigo = rep(codigo, geo_length),
longitude = long,
latitude = lat
)
)
}
map_df_county <- bind_rows(maps_df_counties)
df %>%
select(municipio) %>%
group_by(municipio) %>%
summarise(n = n()) %>%
right_join(map_df_county, by = 'municipio') %>%
ggplot(aes(x = longitude, y = latitude, group = codigo, fill = n)) +
geom_polygon() +
geom_path(color = 'white', size = 0.1) +
scale_fill_continuous(low = "orange",
high = "darkred",
name = 'Number of fires'
)

What about that total number of fire ocurrences by state? From when does it come from?
df %>%
mutate(ano = year(datahora)) %>%
group_by(estado, ano) %>%
ggplot(aes(x = factor(estado), fill = factor(ano))) +
geom_bar() +
coord_flip()

Let’s repeat the last plot, but now considering the proportions
df %>%
mutate(ano = year(datahora)) %>%
group_by(estado, ano) %>%
ggplot(aes(x = factor(estado), fill = factor(ano))) +
geom_bar(position = "fill") +
coord_flip()

Let’s reinforce what we’ve seen by far and plot the data considering a monthly frequency
# Capitalizes names
capitalize <- function(x) {
s <- strsplit(x, " ")[[1]]
paste(toupper(substring(s, 1,1)), substring(s, 2),
sep="", collapse=" ")
}
levels <- c("Janeiro", "Fevereiro", "Março", "Abril", "Maio", "Junho", "Julho", "Agosto", "Setembro", "Outubro", "Novembro", "Dezembro")
df %>%
mutate(mes = factor(sapply(months(datahora), capitalize), levels = levels)) %>%
group_by(mes) %>%
ggplot(aes(x = factor(mes), y = 100*..prop.., group = 1)) +
geom_bar()

df %>%
mutate(mes = factor(sapply(months(datahora), capitalize), levels = levels),
ano = factor(year(datahora))) %>%
group_by(ano, mes) %>%
ggplot(aes(x = mes, y = 100*..prop.., group = 1)) +
geom_bar() +
facet_wrap(~ano) + coord_flip()

As we can see, in 2019 the trend was a bit different, maybe it is so because of the popular protests against the liberal and progressive view of brazilian government towards amazonia.
Let’s compact this plot in only one frame:
levels <- c("Jan", "Fev", "Mar", "Abr", "Mai", "Jun", "Jul", "Ago", "Set", "Out", "Nov", "Dez")
df %>%
mutate(mes = factor(sapply(substr(months(df$datahora), 1, 3), capitalize), levels = levels),
ano = factor(year(datahora))) %>%
group_by(ano, mes) %>%
ggplot(aes(x = mes, fill = ano)) +
geom_bar() +
facet_wrap(~ano) +
coord_flip()

df %>%
mutate(mes = factor(sapply(substr(months(df$datahora), 1, 3), capitalize), levels = levels),
ano = factor(year(datahora))) %>%
group_by(ano, mes) %>%
ggplot(aes(x = mes, fill = ano)) +
geom_bar(position = "dodge")

df %>%
mutate(date = date(datahora)) %>%
group_by(date) %>%
ggplot(aes(x = date, group = 1)) +
geom_freqpoly(stat = "count") +
scale_x_date(date_breaks = "6 months", date_labels = "%b/%y")

Strangely, our dataset only has records of fire occurences from 3 to 6 pm. Why is it?
df %>%
mutate(hour = hour(datahora)) %>%
group_by(hour) %>%
ggplot(aes(x = hour, group = 1)) +
geom_bar(stat = "count")

O dataframe original não tem informações sobre macro regiões. Utilizando o data frame de mapas, podemos obter esta informação.
regionsList <- list()
codMacroRegions <- list()
n_reg <- length(unique(map_df$regiao))
regions <- unique(map_df$regiao)
for(i in 1:n_reg){
temp <- map_df %>% filter(regiao == regions[i])
states <- unique(temp$estado)
regionsList[[i]] <- states
codMacroRegions[[i]] <- regions[i]
}
names(regionsList) <- c("NORTE", "NORDESTE", "SUDESTE", "CENTRO-OESTE", "SUL")
for(i in 1:n_reg){
df$macroRegiao[df$estado %in% regionsList[[i]]] <- names(regionsList)[i]
df$codMacroRegiao[df$estado %in% regionsList[[i]]] <- codMacroRegions[[i]]
}
# O nome regiao do dataframe map_df nao tem o mesmo sentido que no dataframe df
# sendo assim, criou-se um dataframe auxiliar com os nomes correspondentes
temp <- df[,c("macroRegiao", "codMacroRegiao")]
names(temp) <- c("macroregiao", "regiao")
temp %>%
group_by(macroregiao, regiao) %>%
summarise(n = n()) %>%
right_join(map_df, by = 'regiao') %>%
ggplot(aes(x = longitude, y = latitude, group = regiao, fill = n/1000)) +
geom_polygon() +
geom_path(color = 'white', size = 0.1) +
scale_fill_continuous(low = "orange",
high = "darkred",
name = 'Number of fires\n(thousands)'
)

na.omit(df) %>% ggplot(aes(x = frp, y = precipitacao, alpha = riscofogo)) +
geom_point()

na.omit(df) %>% ggplot(aes(x = log(frp), y = precipitacao, alpha = riscofogo)) +
geom_point()

Notice the relationship among the fire radiative power, the precipitation and the risk of fire. It seems that, with more precipitation, the frp is reduced while with less precipitation the risk of fire remains mostly
na.omit(df) %>% ggplot(aes(x = riscofogo, y = precipitacao, alpha = frp)) +
geom_point(position = "jitter")

Looking at the distribution of the fire radiative power across different biomas, we can see that it is roughly the same, except for Mata Atlantica and Pampa maybe because Pampa is situated at the south brazil, where temperatures are lower which decreases the chances of fire ocurrences. On the other side, Mata Atlantica is mostly distributed along the coast, implying that the humidity that comes from the ocean may hold fire ocurrences from stroke.
na.omit(df) %>% ggplot(aes(x = bioma, y = log(frp))) +
geom_boxplot()

p <- na.omit(df) %>% ggplot(aes(x = log(frp), col = bioma, fill = bioma)) +
geom_density(alpha = 0.1)
ggplotly(p)
Removed 1 rows containing non-finite values (stat_density).
As we can see below, we cannot take any conclusion about the risk of fire with respect to the fire radiative power. So sad.
na.omit(df) %>%
ggplot(aes(x = riscofogo, y = log(frp), col = bioma)) +
geom_point() + # Copy from Plot 1
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(aes(group = 1), method = "lm", se = FALSE, linetype = 2)

data19 <- df %>% filter(year(datahora) == '2019')
df %>%
select(estado) %>%
group_by(estado) %>%
summarise(n = n()) %>%
right_join(map_df, by = 'estado') %>%
ggplot(aes(x = longitude, y = latitude, group = codigo)) +
geom_polygon() +
geom_point(inherit.aes = F, data = data19, aes(x = longitude, y = latitude, col = bioma), size = 0.1, shape = 3) +
geom_path(color = 'white', size = 0.1)

data_yearly <- df %>% mutate(ano = year(datahora))
df %>%
select(estado, datahora) %>%
mutate(ano = year(datahora)) %>%
group_by(estado, ano) %>%
summarise(n = n()) %>%
right_join(map_df, by = 'estado') %>%
ggplot(aes(x = longitude, y = latitude, group = codigo)) +
geom_polygon() +
geom_point(inherit.aes = F, data = data_yearly, aes(x = longitude, y = latitude, col = bioma), size = 0.05) +
facet_wrap(~ano) +
geom_path(color = 'white', size = 0.1)

#
#
From NasaPower API we can obtain useful data for about 146 weather and climate parameters, such as:
Relative humidty (at two meters) Temperature (at two meters) Precipitation (mm) Maximum/Minimum Monthly Difference From Monthly Averaged All Sky Insolation Wind Speed
geographical data are provided at the resolution of 1/2 arc degree longitude by 1/2 arc degree latitude. for more detailed info go to https://power.larc.nasa.gov/documents/POWER_Data_v9_methodology.pdf
Another useful resource can be found at https://earthobservatory.nasa.gov/images/145464/fires-in-brazil
Also in https://www.globalfiredata.org/data.html we can find data such as carbon emissions and dry matter emissions.
# nasaViirs <- read.csv("./database/DL_FIRE_V1_77994/fire_archive_V1_77994.csv")
# nasaModis <- read.csv("./database/DL_FIRE_M6_78067/fire_archive_M6_78067.csv")
# object_size(nasaModis)
# object_size(nasaViirs)
# View(nasaModis)
# View(nasaViirs)
---
title: "WildFires Analysis - Exploratory Data Analysis"
output: html_notebook
---

```{r warning=FALSE}
# Loading packages
library(pryr) # Memory usage information
library(lubridate) # date manipulation
library(dplyr) # data manipulation
library(tidyr) # data manipulation
library(ggplot2) # graphics
library(plotly)
library(stringr)
library(stringi) # remove diacritics
library(ggfortify)
library(rjson)
library(nasapower)
```


There are 4 csv files containing data about the fires of all ocurrences around all brazilian territory. Each file has data between 1st Sept to 30th Sept of each year from 2015 to 2019.

```{r}
temp <- list.files(path = "./database" ,pattern = "*.csv")
myfiles = lapply(paste0("./database/",temp), read.csv)

nrecords <- lapply(myfiles, nrow)

data <- bind_rows(myfiles)
names(data)

# As we have many data records, we'll be working with a small subset of it
df <- data[sample(nrow(data), 0.1*nrow(data)), ]
totalNo <- nrow(df)

object_size(df)
```

there are 820929 observations in our dataset which sums up to 66.8 MB of storaged data.

```{r}
head(df)
```

```{r}
str(df)
```

Aparentemente, A coluna de área industrial e FRP tem poucas entradas não nulas


```{r}
paste("Percentage of NAs entries on `FRP` =",
      100*sum(is.na(df$frp))/totalNo)

wetDays <- df$diasemchuva <= 10
dryDays <- df$diasemchuva > 10
noWeatherInfo <- is.na(df$diasemchuva)

paste("Percentage of NAs entries on `diasemchuva` =",
      round(100*(sum(noWeatherInfo))/totalNo,2))

paste("Percentage of records with less than (or equal to) 10 days of no rain =",
      round(100*(sum(wetDays, na.rm = T))/totalNo,2))

paste("Percentage of records with more than 10 days of no rain =",
      round(100*(sum(dryDays, na.rm = T))/totalNo,2))
```

By the last results, it is possible that the number of days with no rain have a huge impact over the fires triggers.

we can also investigate another columns that have either redundant data or not useful data.

```{r}
levels(df$pais)
levels(df$satelite)

df$pais <- NULL
df$satelite <- NULL

object_size(df)
```

By looking at the structure of our dataframe, it becomes clear that the `Datahora` is a datetime object not a Factor.

```{r}
df$datahora <- ymd_hms(str_replace_all(df$datahora,"/","-"))

levels(df$bioma)
```

We have 6 distincts biomas on our dataset. Perhaps, by knowing how is the distribution of fires among different biomas, we can refine our research.

```{r}
counts <- table(df$bioma)
counts

proportions <- 100*counts / sum(counts)
proportions
```

As we can see, Amazonia represents around half of our dataset. In addition, the number of fire occurrences is very much like the proportion of land that each bioma has in brazilian territory.

-------------------------- Visualisations -----------------------------

```{r, fig.width = 4, fig.height= 3}
ggplot(df) + 
    geom_bar(aes(x = bioma, y = 100*..prop.., group = 1)) +
    coord_flip()
```

Now, it is also important to get the underlying trend of the number of fire ocurrences through the months of the year. Let's start exploring this idea:

```{r}
df %>% 
  mutate(month = floor_date(date(datahora), "month")) %>% 
  group_by(month, bioma) %>% 
  summarise(n_fires = n()) %>% 
  ggplot(aes(x = month, y = n_fires/1000)) +
  geom_bar(aes(fill = bioma), stat = 'identity') + 
    scale_x_date(date_labels = "%b/%y", date_breaks = "6 months") +
    labs(y = "Number of fires in thousands", title = 'Brazilian Wildfires by Year')
```

In order to plot the number of fires by state, we must do some data wrangling to put the data in the desired format.

```{r}

is.odd <- function(x) x %% 2 != 0
is.even <- function(x) x %% 2 == 0

geo_data <- fromJSON(file = "./geo_data/geo_data.json")

maps_df <- list()
  for (i in 1:length(geo_data$features)) {
    coordinates <- unlist(geo_data$features[[i]]$geometry$coordinates[[1]])
    long <- coordinates[is.odd(seq_along(coordinates))]
    lat <-  coordinates[is.even(seq_along(coordinates))]
    
    if(length(long) != length(lat)){
      stop("Wrong subset of coordinate vector", call. = FALSE)
    }
    
    geo_length <- length(long)
    
    estado <- toupper(geo_data$features[[i]]$properties$name)
    codigo <- as.integer(geo_data$features[[i]]$properties$codigo_ibg)
    regiao <- as.integer(geo_data$features[[i]]$properties$regiao_id)
    
    maps_df[[i]] <- as_tibble(
      list(
        estado = rep(stri_trans_general(estado,"Latin-ASCII"), geo_length),
        codigo = rep(codigo, geo_length),
        regiao = rep(regiao, geo_length),
        longitude = long,
        latitude = lat
      )
    )
  }
  
  map_df <- bind_rows(maps_df)
```

```{r}
# the `estado` column of the map dataframe is a character, so in order
# to avoid coertion from factor to character, let's convert it explicitly
df$estado <- as.character(df$estado)
```

```{r}
df %>% 
  select(estado) %>% 
  group_by(estado) %>% 
  summarise(n = n()) %>% 
  right_join(map_df, by = 'estado') %>% 
  ggplot(aes(x = longitude, y = latitude, group = codigo, fill = n/1000)) + 
    geom_polygon() + 
    geom_path(color = 'white', size = 0.1) + 
    scale_fill_continuous(low = "orange", 
                          high = "darkred",
                          name = 'Number of fires\n(thousands)'
                          ) 
```

As we can see, the state with the most fire ocurrences is Pará in which most of its lands belongs to the amazon ecossystem, this may explain why half the dataset ocurrences comes from Amazon fires.

As we can see below, this trend happens all years.

```{r}
df %>% 
  select(estado, datahora) %>% 
  mutate(ano = year(datahora)) %>% 
  group_by(estado, ano) %>% 
  summarise(n = n()) %>% 
  right_join(map_df, by = 'estado') %>% 
  ggplot(aes(x = longitude, y = latitude, group = codigo, fill = n/1000)) + 
    geom_polygon() +
    facet_wrap(~ano) +
    geom_path(color = 'white', size = 0.1) + 
    scale_fill_continuous(low = "orange", 
                          high = "darkred",
                          name = 'Number of fires\n(thousands)'
                          ) 
```

In terms of raw numbers, we can visualize this difference below.

```{r}
df %>% 
  group_by(estado) %>% 
  ggplot(aes(x = factor(estado))) +
    geom_bar() +
    coord_flip()
```

Going down one level, let's try an analysis by county

```{r}
geo_data_counties <- fromJSON(file = "./geo_data/geo_data_counties.json")

maps_df_counties <- list()
  for (i in 1:length(geo_data_counties$features)) {
    coordinates <- unlist(geo_data_counties$features[[i]]$geometry$coordinates[[1]])
    long <- coordinates[is.odd(seq_along(coordinates))]
    lat <-  coordinates[is.even(seq_along(coordinates))]
    
    if(length(long) != length(lat)){
      stop("Wrong subset of coordinate vector", call. = FALSE)
    }
    
    geo_length <- length(long)
    
    municipio <- geo_data_counties$features[[i]]$properties$name
    municipio <- toupper(stri_trans_general(municipio, "Latin-ASCII"))
    codigo <- as.integer(geo_data_counties$features[[i]]$properties$id)
    
    maps_df_counties[[i]] <- as_tibble(
      list(
        municipio = rep(stri_trans_general(municipio, "Latin-ASCII"), geo_length),
        codigo = rep(codigo, geo_length),
        longitude = long,
        latitude = lat
      )
    )
  }
  
  map_df_county <- bind_rows(maps_df_counties)
```

```{r}
df %>% 
  select(municipio) %>% 
  group_by(municipio) %>% 
  summarise(n = n()) %>% 
  right_join(map_df_county, by = 'municipio') %>% 
  ggplot(aes(x = longitude, y = latitude, group = codigo, fill = n)) + 
    geom_polygon() + 
    geom_path(color = 'white', size = 0.1) +
    scale_fill_continuous(low = "orange", 
                          high = "darkred",
                          name = 'Number of fires'
                          ) 
```

What about that total number of fire ocurrences by state? From when does it come from?

```{r}
df %>% 
  mutate(ano = year(datahora)) %>% 
  group_by(estado, ano) %>% 
  ggplot(aes(x = factor(estado), fill = factor(ano))) +
    geom_bar() +
    coord_flip()
```

Let's repeat the last plot, but now considering the proportions

```{r}
df %>% 
  mutate(ano = year(datahora)) %>% 
  group_by(estado, ano) %>% 
  ggplot(aes(x = factor(estado), fill = factor(ano))) +
    geom_bar(position = "fill") +
    coord_flip()
```

Let's reinforce what we've seen by far and plot the data considering a monthly frequency

```{r}
# Capitalizes names

capitalize <- function(x) {
  s <- strsplit(x, " ")[[1]]
  paste(toupper(substring(s, 1,1)), substring(s, 2),
      sep="", collapse=" ")
}
```


```{r}
levels <- c("Janeiro", "Fevereiro", "Março", "Abril", "Maio", "Junho", "Julho", "Agosto", "Setembro", "Outubro", "Novembro", "Dezembro")

df %>% 
  mutate(mes = factor(sapply(months(datahora), capitalize), levels = levels)) %>%
  group_by(mes) %>%
  ggplot(aes(x = factor(mes), y = 100*..prop.., group = 1)) +
    geom_bar()

df %>% 
  mutate(mes = factor(sapply(months(datahora), capitalize), levels = levels),
         ano = factor(year(datahora))) %>%
  group_by(ano, mes) %>%
  ggplot(aes(x = mes, y = 100*..prop.., group = 1)) +
    geom_bar() + 
    facet_wrap(~ano) + coord_flip()
```

As we can see, in 2019 the trend was a bit different, maybe it is so because of the popular protests against the liberal and progressive view of brazilian government towards amazonia.

Let's compact this plot in only one frame:

```{r}

levels <- c("Jan", "Fev", "Mar", "Abr", "Mai", "Jun", "Jul", "Ago", "Set", "Out", "Nov", "Dez")

df %>% 
  mutate(mes = factor(sapply(substr(months(df$datahora), 1, 3), capitalize), levels = levels),
         ano = factor(year(datahora))) %>%
  group_by(ano, mes) %>%
  ggplot(aes(x = mes, fill = ano)) +
    geom_bar() + 
    facet_wrap(~ano) +
    coord_flip()

df %>% 
  mutate(mes = factor(sapply(substr(months(df$datahora), 1, 3), capitalize), levels = levels),
         ano = factor(year(datahora))) %>%
  group_by(ano, mes) %>%
  ggplot(aes(x = mes, fill = ano)) +
    geom_bar(position = "dodge")

```

```{r}
df %>% 
  mutate(date = date(datahora)) %>% 
  group_by(date) %>% 
  ggplot(aes(x = date, group = 1)) +
   geom_freqpoly(stat = "count") +
   scale_x_date(date_breaks = "6 months", date_labels = "%b/%y")
```

Strangely, our dataset only has records of fire occurences from 3 to 6 pm. Why is it?

```{r}
df %>% 
  mutate(hour = hour(datahora)) %>% 
  group_by(hour) %>% 
  ggplot(aes(x = hour, group = 1)) +
   geom_bar(stat = "count")
```

O dataframe original não tem informações sobre macro regiões. Utilizando o data frame de mapas, podemos obter esta informação.

```{r}

regionsList <- list()
codMacroRegions <- list()
n_reg <- length(unique(map_df$regiao))
regions <- unique(map_df$regiao)

for(i in 1:n_reg){
  temp <- map_df %>% filter(regiao == regions[i])
  
  states <- unique(temp$estado)
  regionsList[[i]] <- states
  codMacroRegions[[i]] <- regions[i]
}

names(regionsList) <- c("NORTE", "NORDESTE", "SUDESTE", "CENTRO-OESTE", "SUL")

for(i in 1:n_reg){
  df$macroRegiao[df$estado %in% regionsList[[i]]] <- names(regionsList)[i]
  df$codMacroRegiao[df$estado %in% regionsList[[i]]] <- codMacroRegions[[i]]
}
```

```{r}

# O nome regiao do dataframe map_df nao tem o mesmo sentido que no dataframe df
# sendo assim, criou-se um dataframe auxiliar com os nomes correspondentes
temp <- df[,c("macroRegiao", "codMacroRegiao")]
names(temp) <- c("macroregiao", "regiao")

temp %>% 
  group_by(macroregiao, regiao) %>% 
  summarise(n = n()) %>% 
  right_join(map_df, by = 'regiao') %>% 
  ggplot(aes(x = longitude, y = latitude, group = regiao, fill = n/1000)) + 
    geom_polygon() + 
    geom_path(color = 'white', size = 0.1) + 
    scale_fill_continuous(low = "orange", 
                          high = "darkred",
                          name = 'Number of fires\n(thousands)'
                          ) 
```



```{r}
na.omit(df) %>% ggplot(aes(x = frp, y = precipitacao, alpha = riscofogo)) +
  geom_point()

na.omit(df) %>% ggplot(aes(x = log(frp), y = precipitacao, alpha = riscofogo)) +
  geom_point()
```

Notice the relationship among the fire radiative power, the precipitation and the risk of fire. It seems that, with more precipitation, the frp is reduced while with less precipitation the risk of fire remains mostly

```{r}
na.omit(df) %>% ggplot(aes(x = riscofogo, y = precipitacao, alpha = frp)) +
  geom_point(position = "jitter")
```

Looking at the distribution of the fire radiative power across different biomas, we can see that it is roughly the same, except for Mata Atlantica and Pampa maybe because Pampa is situated at the south brazil, where temperatures are lower which decreases the chances of fire ocurrences. On the other side, Mata Atlantica is mostly distributed along the coast, implying that the humidity that comes from the ocean may hold fire ocurrences from stroke.

```{r}
na.omit(df) %>% ggplot(aes(x = bioma, y = log(frp))) +
  geom_boxplot()
```

```{r}
p <- na.omit(df) %>% ggplot(aes(x = log(frp), col = bioma, fill = bioma)) +
  geom_density(alpha = 0.1)

ggplotly(p)
```

As we can see below, we cannot take any conclusion about the risk of fire with respect to the fire radiative power. So sad.

```{r}
na.omit(df) %>% 
  ggplot(aes(x = riscofogo, y = log(frp), col = bioma)) +
    geom_point() + # Copy from Plot 1
    geom_smooth(method = "lm", se = FALSE) +
    geom_smooth(aes(group = 1), method = "lm", se = FALSE, linetype = 2) 
```



```{r}
data19 <- df %>% filter(year(datahora) == '2019')

df %>% 
  select(estado) %>% 
  group_by(estado) %>% 
  summarise(n = n()) %>% 
  right_join(map_df, by = 'estado') %>% 
  ggplot(aes(x = longitude, y = latitude, group = codigo)) + 
    geom_polygon() + 
    geom_point(inherit.aes = F, data = data19, aes(x = longitude, y = latitude, col = bioma), size = 0.1, shape = 3) +
    geom_path(color = 'white', size = 0.1)
```

```{r}
data_yearly <- df %>% mutate(ano = year(datahora))

df %>%
  select(estado, datahora) %>%
  mutate(ano = year(datahora)) %>%
  group_by(estado, ano) %>% 
  summarise(n = n()) %>% 
  right_join(map_df, by = 'estado') %>%
  ggplot(aes(x = longitude, y = latitude, group = codigo)) +
    geom_polygon() +
    geom_point(inherit.aes = F, data = data_yearly, aes(x = longitude, y = latitude, col = bioma), size = 0.05) +
    facet_wrap(~ano) +
    geom_path(color = 'white', size = 0.1)
# 
# 
```



From NasaPower API we can obtain useful data for about 146 weather and climate parameters, such as:

Relative humidty (at two meters)
Temperature (at two meters)
Precipitation (mm)
Maximum/Minimum Monthly Difference From Monthly Averaged All Sky Insolation
Wind Speed 

geographical data are provided at the resolution of 1/2 arc degree longitude by 1/2 arc degree latitude. for more detailed info go to https://power.larc.nasa.gov/documents/POWER_Data_v9_methodology.pdf

Another useful resource can be found at https://earthobservatory.nasa.gov/images/145464/fires-in-brazil

Also in https://www.globalfiredata.org/data.html we can find data such as carbon emissions and dry matter emissions.

```{r}
# nasaViirs <- read.csv("./database/DL_FIRE_V1_77994/fire_archive_V1_77994.csv")
# nasaModis <-  read.csv("./database/DL_FIRE_M6_78067/fire_archive_M6_78067.csv")
# object_size(nasaModis)
# object_size(nasaViirs)
# View(nasaModis)
# View(nasaViirs)
```






